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1.  Introduction  and  Historical  Remarks 
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The  Iterative  Proportional  Fitting  Procedure  (IPFP)  is  a  commonly  used  algorithm  for 
maximum  likelihood  estimation  in  loglinear  models.  The  simplicity  of  the  algorithm  and  its 
relation  to  the  theory  of  loglinear  models  make  it  a  useful  tool,  especially  for  the  analysis  of 
cross-classified  categorical  data  (q.v.)  or  contingency  tables  (q.v.). 


To  illustrate  the  algorithm  we  consider  a  three-way  table  of  independent  Poisson  counts, 
{x  }.  Suppose  we  wish  to  fit  the  loglinear  model  of  no-three-factor  interaction  for  the 
the  model 
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mean  m,  i.e. 
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The  basic  IPFP  takes  an  initial  table  m10’,  such  that  ln(m'01)  satistfies  the  model  (typically  we 
would  use  m'°'  =  1  for  all  i.j,  and  k)  and  sequentially  scales  the  current  fitted  table  to  satisfy 
the  three  sets  of  the  two-way  margins  of  the  observed  table,  x.  The  v’th  iteration  consists  of 
three  steps  which  form: 


(2) 


(The  first  superscript  refers  to  the  iteration  number,  and  the  second  to  the  step  number  within 
iterations).  The  algorithm  continues  until  the  observed  and  fitted  margins  are  sufficiently  close. 
For  a  detailed  discussion  of  convergence  and  some  of  the  other  properties  of  the  algorithm, 
see  Bishop.  Fienberg  and  Holland  (1975)  or  Haberman  (1974).  A  FORTRAN  implementation  of 
the  algorithm  is  given  in  Haberman  (1972  and  1973).  (See  also  the  discussion  of  computer 
programs  for  loglinear  models  in  the  entry  Contingency  Tables .  by  Fienberg). 
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As  a  computational  technique  for  adjusting  tables  of  counts,  the  IPFP  appears  to  have 
been  first  described  by  Kruithof  (1937)  (see  also  Krupp  (1979))  and  then  independently 
formulated  by  Deming  and  Stephan  (1940).  They  considered  the  problem  of  adjusting  (or 
raking)  a  table,  n  =  (n  ).  of  counts  to  satisfy  some  external  information  about  the  margins  of 

ijk 

the  table.  Deming  (1943,  p.107)  gives  an  example  of  a  cross-classification,  by  age  and  by  state, 
of  white  persons  attending  school  in  New  England.  The  population,  N  =  IN  ^1,  cross¬ 
classification  is  unknown  but  the  marginal  totals  are  known.  In  addition  a  sample,  n.  from 
the  population  is  available.  Deming  and  Stephan’s  aim  was  to  find  an  estimate  N  which 
satisfies  the  marginal  constraints  and  minimizes  the  like  distance. 

X(N  -  n  ):/n 

u  0  u 


(3) 


Their  erroneous  solution  (see  Stephan  (1942))  was  the  IPFP.  Although  the  N  produced  by  the 
IPFP  need  not  minimize  (3),  it  does  provide  an  approximate  and  easily  calculated  solution. 

Over  twenty  years  after  the  work  of  Deming  and  Stephan.  Darroch  (1962)  implicity  used 
a  version  of  the  IPFP  to  find  the  maximum  likelihood  estimates  in  a  contingency  table  but  left 
the  details  of  the  general  algorithm  unclear.  Bishop  (1967)  was  the  first  to  show  how  the 
IPFP  could  be  used  to  solve  the  maximum  likelihood  estimation  problem  in  multidimensional 
tables.  Some  further  history  and  other  uses  of  the  algorithm,  including  applications  to 
doub/y-stochastic  matrices  (q.v.),  are  discussed  in  Fienberg  (1970). 

2.  A  Coordinate-free  Version  of  the  IPFP 

The  basic  IPFP  is  applicable  to  a  class  of  models  much  more  general  than  those  described 
soley  in  terms  of  margins  of  a  multiway  table.  Consider  an  index  set  J  with  J  elements  and 
let  x  be  a  table  of  observed  counts  which  are  realizations  of  independent  Poisson  random 
variables  with  mean  m.  Further  let  W  be  a  linear  subspace  of  RJ  with  a  spanning  set  (f^  :  k 

=  1.2 K  )  where  each  f  is  a  vector  of  zeros  and  ones.  The  calculation  of  the  maximum 

likelihood  estimate  m  for  the  loglinear  model 
ln(m)  €  M  , 

begins  by  taking  a  starting  table  m101  with  ln(m10')  f  M  (ml0>  =  1  will  always  work),  and 

sequentially  adjusts  the  table  to  satisfy  the  "margins",  i.e. <f^  .x>  for  k=1.2 K  .  the  inner 

products  of  the  data  with  the  spanning  vectors.  The  v'th  cycle  of  the  procedure  takes  the 
current  estimate  mu'l-Kl  =  mU  o1  and  forms 

<f  ,x>  f 

m,vAI  =  mu  k-"  - * - -  +  ra"''1'-"  .  (1  -  f  )  .  k=1.2....K  .  (4) 

<f  ,mu  k'">  L 

k 

(i.e.  adjusts  the  current  fitted  table  so  that  the  margin  corresponding  to  f^  is  correct)  to  yield 
mlvl  =  m'1  Kl  .  The  maximum  likelihood  estimate  is  Jim  mul  .  If  one  wished  to  fit  the  log- 
affine  model 

ln(m)  €  t  +  M  . 

which  is  just  the  translation  by  t  of  the  loglinear  model  M  ,  then  using  the  IPFP  with  starting 
values  which  satisfy  this  model  (e.g.  m101  =  exp(t)  )  leads  to  the  MLE. 

There  are  many  ways  to  view  this  basic  algorithm  and  many  problems  for  which  the  IPFP 
is  of  especial  use.  Although,  the  basic  algorithm  is  limited  to  linear  manifolds.  M  .  with  zero- 
one  spanning  sets,  it  is  possible  to  generalize  the  method  to  work  with  any  linear  manifold. 
We  now  look  at  some  topics  which  relate  to  the  algorithm  or  its  generalizations. 


3.  Some  Computational  Properites 

Common  alternatives  to  the  IPFP  are  versions  of  Newton’s  method  or  other  algorithms 
which  use  information  about  the  second  derivative  of  the  likelihood  function,  or  Hessian 
(q.v.).  While  such  methods  have  quadratic  convergence  properties  compared  to  the  linear 
properties  of  the  IPFP.  and  are  often  quite  efficient  (see  e.g  Chambers  (1977).  Haberman 
(1974).  or  Fienberg.  Meyer  and  Stewart  (1979))  they  are  of  limited  use  for  models  of  high 
dimensionality.  For  example,  the  model  of  no-three-factor  interaction  in  a  10  x  10  x  10  table 
has  271  parameters  and  this  requires  %  x  271  x  272  =  36.856  numbers  to  represent  the  Hessian. 
In  contrast  the  IPFP  requires  only  about  300  numbers  (i.e.  the  3  marginal  totals)  in  addition  to 
the  table  itself.  For  many  large  contingency  table  problems  the  IPFP  is  the  most  reasonable 
computational  method  in  use.  Of  course,  for  problems  with  only  a  small  number  of 
parameters  Newton's  method  may  be  preferable,  especially  when  the  model  is  such  that  the 
basic  IPFP  is  not  applicable.  Newton's  method  also  automatically  produces  an  estimate  of  the 
variance-covariance  matrix  of  the  parameters,  but  this  is  what  requires  all  of  the  storage  space. 

It  is  well  known  that  the  IPFP  can  often  be  slow  to  converge.  Our  experience  is  that  it 
is  generally  restrictions  on  storage  rather  than  computational  time  which  limit  an  algorithm's 
usefulness.  Thus  slow  convergence,  while  disturbing  in  some  contexts,  is  not  necessarily  a 
crucial  property. 

As  we  have  seen,  the  basic  IPFP  is  very  simple  and  can  require  little  more  than  hand 
calculation.  The  simplicity  of  the  algorithm  allows  one  to  understand  and  use  the  mechanics 
of  the  calculations  to  show  theoretical  results.  A  good  example  of  this  is  the  theory  of 
decomposable  models  (models  with  closed-form  estimates)  as  developed  by  Bishop.  Fienberg  and 
Holland  (1975)  or  Haberman  (1974).  These  models  are  closely  related  to  the  IPFP  :  a 
fundamental  theorem  (Haberman  (1974.  p.  191))  says  that  for  every  decomposable  model  there 
is  an  ordering  of  the  margins  such  that  the  simple  IPFP  converges  in  one  iteration. 

One  of  the  ideas  underlying  the  IPFP  is  to  sequentially  equate  a  vector  of  expected  values 

with  the  sufficient  statistics  of  the  model.  The  IPFP  does  this  one  dimension  at  a  time  but 

there  is  no  reason  why  several  dimensions  can  not  be  simultaneously  adjusted.  This  idea 
underlies  the  estimation  scheme  for  partially  decomposable  graphical  models  outlined  in 
Darroch,  Lauritzen  and  Speed  (1980).  They  show  that  for  many  models  it  is  possible  to  fit 
certain  subsets  of  the  marginal  totals  and  to  combine  the  resulting  partial  estimates  using  a 

direct  formula.  Their  approach  helps  answer  the  question  which  asks  for  the  cyclic  order  that 

the  IPFP  should  use  in  satisfying  the  marginal  totals.  The  results  of  Darroch.  Lauritzen  and 
Speed  show  that  certain  groupings  and  orderings  are  particularly  advantageous. 


4.  Generalizations  of  the  IPFP 

A  limitation  of  the  basic  IPFP  is  that  only  certain  types  of  models  can  be  fit.  We  now 
consider  several  methods  for  extending  the  IPFP  to  cover  any  loglinear  model.  For 

multinomial  and  Poisson  data  the  problems  of  maximizing  the  likelihood  function  and 
minimizing  the  Ku  1 1  back -Lei  bier  information  number  (q.v.)  can  be  considered  as  dual 
problems  which  lead  to  the  same  estimates  (see  the  entry.  Contingency  Tables  by  Fienberg). 
We  now  consider  generalizations  of  the  IPFP  from  both  these  points  of  view. 

Haberman  (1974)  shows  that,  when  viewed  from  the  likelihood  perspective,  the  IPFP  is 
just  a  version  of  the  cyclic  coordinate  ascent  method  of  functional  maximization.  To  illustrate 
Haberman's  approach,  we  choose  a  fixed  set  of  vectors  which  span  the  model  space.  M  .  and 
then  we  maximize  the  likelihood  along  each  of  these  directions  in  turn.  Specifically,  we 

consider  a  set  of  vectors  =  {f^:  k  =  1.2 . KJ  which  span  M.  If  we  denote  the  log- 

likelihood  by  Urn  |  x)  and  consider  an  initial  estimate  m101  with  ln(m'01)  in  M  .  then  the 
algorithm  proceeds  by  finding  m1"  such  that 

Infm111)  =  ln(m(,'n)  +  «fk  ;  i  =  k  mod  |K|  . 

where  a  is  determined  so  as  to  increase  the  likelihood  sufficiently.  When  f  is  a  vector  of 

i  k 

zeros  and  ones 

a  =  ln(  <f  ,x>  /<f  ) 

i  k  k 

(i.e.  the  a  corresponding  to  the  IPFP  adjustment  maximizes  the  likelihood  in  this  direction). 
For  arbitrary  there  is  no  direct  estimate  of  a  and  we  are  left  with  a  one  dimensional 
maximization  problem. 

Csiszar  (1975)  considers  the  IPFP  as  a  method  for  maximizing  the  Kullback-Leibler 
information  between  two  probability  distributions.  When  specialized  to  distributions  on  finite 
sets,  Csiszar's  methods  yield  a  generalized  IPFP.  The  class  of  algorithms  which  result  from 
Csiszar's  work  are  dual  algorithms  to  the  cyclic  ascent  methods  except  now  maximization  can 
be  over  entire  subspaces  of  M  rather  than  just  vectors.  These  methods  yield  powerful 
theoretical  tools  and  have  been  instrumental  in  finding  new  algorithms  which  combine  some  of 
the  advantages  of  both  Newton's  method  and  the  IPFP  (see  Meyer  (1981)). 

The  third  generalization  of  the  IPFP  we  consider  is  due  to  Darroch  and  Ratcliff  (1972). 
This  algorithm,  known  as  Generalized  Iterative  Scaling,  was  also  developed  from  the 
information  theory  perspective,  but  is  not  closely  related  to  Csiszar’s  method.  The  calculations 
are  similar  to  those  of  the  basic  IPFP:  a  set  of  vectors  F  which  span  M  is  chosen  and  the 
likelihood  is  increased  (but  not  maximized)  in  each  of  these  directions  in  turn.  Each  iteration 
can  require  that  the  scaling  factors  be  raised  to  arbitrary  powers.  These  features  combine  to 


5 


make  the  algorithm  expensive  as  it  often  takes  many  iterations  to  converge  and  each  iteration 
is  complicated. 

For  some  problems  it  is  possible  to  avoid  the  complications  of  the  generalized  IPFP's  by 
transforming  the  contingency  table  into  a  form  where  the  basic  IPFP  can  be  used  (see  Meyer 
(1981)  for  details  and  Fienberg.  Meyer,  and  Wasserman  (1981)  for  some  examples).  This  can 
result  in  a  significant  saving  in  the  computational  effort  and  recognition  of  some  of  the 
theoretical  advantages  (e.g.  closed-form  estimates)  associated  with  the  IPFP.  Fienberg  and 
Wasserman  (1981.  Fig.  1)  present  an  example  where  the  convergence  rate  can  be  substantially 
improved  by  taking  advantage  of  this  transformation  technique. 


Related  Entries  Categorical  Data.  Contingency  Tables,  Kullback-Leibler  Information. 
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